{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-Means Demo\n",
    "\n",
    "KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.\n",
    "\n",
    "cuML’s KMeans supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.\n",
    "    \n",
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.\n",
    "\n",
    "For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).\n",
    "\n",
    "For additional information on cuML's k-means implementation: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.KMeans."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import cupy\n",
    "import matplotlib.pyplot as plt\n",
    "from cuml.cluster import KMeans as cuKMeans\n",
    "from cuml.datasets import make_blobs\n",
    "from sklearn.cluster import KMeans as skKMeans\n",
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 100000\n",
    "n_features = 2\n",
    "\n",
    "n_clusters = 5\n",
    "random_state = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "device_data, device_labels = make_blobs(n_samples=n_samples,\n",
    "                                        n_features=n_features,\n",
    "                                        centers=n_clusters,\n",
    "                                        random_state=random_state,\n",
    "                                        cluster_std=0.1)\n",
    "\n",
    "device_data = cudf.DataFrame(device_data)\n",
    "device_labels = cudf.Series(device_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy dataset from GPU memory to host memory.\n",
    "# This is done to later compare CPU and GPU results.\n",
    "host_data = device_data.to_pandas()\n",
    "host_labels = device_labels.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "kmeans_sk = skKMeans(init=\"k-means++\",\n",
    "                     n_clusters=n_clusters,\n",
    "                     n_jobs=-1,\n",
    "                    random_state=random_state)\n",
    "\n",
    "kmeans_sk.fit(host_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "kmeans_cuml = cuKMeans(init=\"k-means||\",\n",
    "                       n_clusters=n_clusters,\n",
    "                       oversampling_factor=40,\n",
    "                       random_state=random_state)\n",
    "\n",
    "kmeans_cuml.fit(device_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize Centroids\n",
    "\n",
    "Scikit-learn's k-means implementation uses the `k-means++` initialization strategy while cuML's k-means uses `k-means||`. As a result, the exact centroids found may not be exact as the std deviation of the points around the centroids in `make_blobs` is increased.\n",
    "\n",
    "*Note*: Visualizing the centroids will only work when `n_features = 2` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(16, 10))\n",
    "plt.scatter(host_data.iloc[:, 0], host_data.iloc[:, 1], c=host_labels, s=50, cmap='viridis')\n",
    "\n",
    "#plot the sklearn kmeans centers with blue filled circles\n",
    "centers_sk = kmeans_sk.cluster_centers_\n",
    "plt.scatter(centers_sk[:,0], centers_sk[:,1], c='blue', s=100, alpha=.5)\n",
    "\n",
    "#plot the cuml kmeans centers with red circle outlines\n",
    "centers_cuml = kmeans_cuml.cluster_centers_\n",
    "plt.scatter(cupy.asnumpy(centers_cuml[0].values), \n",
    "            cupy.asnumpy(centers_cuml[1].values), \n",
    "            facecolors = 'none', edgecolors='red', s=100)\n",
    "\n",
    "plt.title('cuml and sklearn kmeans clustering')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cuml_score = adjusted_rand_score(host_labels, kmeans_cuml.labels_.to_numpy())\n",
    "sk_score = adjusted_rand_score(host_labels, kmeans_sk.labels_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "threshold = 1e-4\n",
    "\n",
    "passed = (cuml_score - sk_score) < threshold\n",
    "print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
